46
Predicting Druglikeness and Toxicity from Integrated Data and Services on the Life Science Semantic Web 1 Michel Dumontier, Ph.D. Associate Professor of Bioinformatics, Department of Biology, School of Computer Science, Institute of Biochemistry, Carleton University Professeur Associé, Département d’informatique et de génie logiciel, Université Laval Ottawa Institute of Systems Biology Ottawa-Carleton Institute of Biomedical Engineering 2011-EBI-Industry-SW::Dumontier

2011 ebi industry workshop

Embed Size (px)

Citation preview

Page 1: 2011 ebi industry workshop

1 2011-EBI-Industry-SW::Dumontier

Predicting Druglikeness and Toxicity from Integrated Data and Services on

the Life Science Semantic Web

Michel Dumontier, Ph.D.

Associate Professor of Bioinformatics, Department of Biology, School of Computer Science, Institute of Biochemistry, Carleton University

Professeur Associé, Département d’informatique et de génielogiciel, Université Laval

Ottawa Institute of Systems BiologyOttawa-Carleton Institute of Biomedical Engineering

Page 2: 2011 ebi industry workshop

2 2011-EBI-Industry-SW::Dumontier

Is caffeine a drug-like molecule?

Is acetaminophen toxic?

Page 3: 2011 ebi industry workshop

3 2011-EBI-Industry-SW::Dumontier

Finding the right information to answer a question is hardand sometimes requires a sophisticated workflow

Page 4: 2011 ebi industry workshop

4 2011-EBI-Industry-SW::Dumontier

Page 5: 2011 ebi industry workshop

5 2011-EBI-Industry-SW::Dumontier

What if we could answer a question by automatically building a knowledge base

using both data and services?

Page 6: 2011 ebi industry workshop

6 2011-EBI-Industry-SW::Dumontier

The Semantic Web is a web of knowledge.

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It enables the answering of sophisticated questions

Page 7: 2011 ebi industry workshop

7 2011-EBI-Industry-SW::Dumontier

To answer this question we need to know:

• what ‘drug like molecule’ really means• caffeine’s molecular structure• the ability to compute the relevant attributes• determine whether caffeine satisfies the requirements of being ‘drug like’

Is caffeine a drug-like molecule?

Page 8: 2011 ebi industry workshop

8 2011-EBI-Industry-SW::Dumontier

Lipinski Rule of Five

• Rule of thumb for druglikeness (orally active in humans)(4 rules with multiples of 5)– mass of less than 500 Daltons– fewer than 5 hydrogen bond donors– fewer than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5

We need a more formal (machine understandable) description of a ‘drug-like molecule’ which specifies values for chemical descriptors

Page 9: 2011 ebi industry workshop

9 2011-EBI-Industry-SW::Dumontier

ontology as a strategy to

formally represent knowledge

Page 10: 2011 ebi industry workshop

10 2011-EBI-Industry-SW::Dumontier

The Web Ontology Language (OWL) Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 11: 2011 ebi industry workshop

11 2011-EBI-Industry-SW::Dumontier

Semanticscience Integrated Ontology (SIO)

• OWL2 ontology• 900+ classes covering basic types (physical, processual, abstract,

informational) with an emphasis on biological entities• 169 basic relations (mereological, participatory, attribute/quality,

spatial, temporal and representational)• axioms can be used by reasoners to generate inferences for

consistency checking, classification and answering questions about life science knowledge

• embodies emerging ontology design patterns – specifies the representation of knowledge

• dereferenceable URIs• searchable in the NCBO bioportal• Available at http://semanticscience.org/ontology/sio.owl

Page 12: 2011 ebi industry workshop

12 2011-EBI-Industry-SW::Dumontier

Page 13: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

The Chemical Information Ontology (CHEMINF)

• 100+ chemical descriptors• 50+ chemical qualities• Relates descriptors to their

specifications, the software that generated them (along with the running parameters, and the algorithms that they implement)

• Contributors: Nico Adams, Leonid Chepelev, Michel Dumontier, Janna Hastings, Egon Willighagen, Peter Murray-Rust, Cristoph Steinbeck

13

http://semanticchemistry.googlecode.com

Page 14: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Molecular structure can be represented using a SMILES string, which is a common representation

of the chemical graph

14

ball & stick model for caffeine

SMILES string for caffeine

Cn1cnc2n(C)c(=O)n(C)c(=O)c12

Page 15: 2011 ebi industry workshop

15 2011-EBI-Industry-SW::Dumontier

Lipinski Rule of Five• Empirically derived ruleset for druglikeness

(4 rules with multiples of 5)– mass of less than 500 Daltons– fewer than 5 hydrogen bond donors– fewer than 10 hydrogen bond acceptors– A partition coefficient value between -5 and 5

• A formal description using OWL:

Page 16: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

What we then need are services that will consume SMILES strings and annotate the molecule with the required chemical

descriptors

16

then we can reason about whether it satisfies the drug-likeness definition

Page 17: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Semantic Automated Discovery and Integration

http://sadiframework.org

Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB

SADI is a framework to create Semantic Web services using OWL classes as service inputs and outputs

17

Page 18: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Create code stubs using the ontology

• Publish the ontology to a web-accessible locationhttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl

• Make sure that the class names are resolvable(easy when using the hash notation)

http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smiles-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#logp-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hbdc-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#hdba-moleculehttp://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#lipinksi-druglike-molecule

• Download/checkout the codehttp://sadiframework.org

• Run the code generator (Java, Perl, python)– specify the URIs that correspond to input and output types

• Implement the functionality– We used the Chemistry Development Kit (CDK) to implement 4 services

18

Page 19: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Responds to a GET operation by providing the service description in RDF

conforms to Feta (BioMoby, myGrid)

19

curl http://cbrass.biordf.net/logpdc/logpc

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:j.0="http://www.mygrid.org.uk/mygrid-moby-service#" > <rdf:Description rdf:about=""> <j.0:hasServiceDescriptionText>no description</j.0:hasServiceDescriptionText> <j.0:hasServiceNameText rdf:datatype="http://www.w3.org/2001/XMLSchema#string">logpc</j.0:hasServiceNameText> <j.0:hasOperation rdf:resource="#operation"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription"/> </rdf:Description> <rdf:Description rdf:about="#input"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#smilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description> <rdf:Description rdf:about="#operation"> <j.0:outputParameter rdf:resource="#output"/> <j.0:inputParameter rdf:resource="#input"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#operation"/> </rdf:Description> <rdf:Description rdf:about="#output"> <j.0:objectType rdf:resource="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#alogpsmilesmolecule"/> <rdf:type rdf:resource="http://www.mygrid.org.uk/mygrid-moby-service#parameter"/> </rdf:Description></rdf:RDF>

Page 20: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Responds to a POST containing service input with a service output in RDF

20

<rdf:Description rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#mdalogp"> <rdf:type rdf:resource="http://semanticscience.org/resource/CHEMINF_000251"/> <j.0:SIO_000300 rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-0.4311000000000006</j.0:SIO_000300> </rdf:Description>

<rdf:RDF xmlns="http://semanticscience.org/sadi/ontology/caffeine.rdf#" xmlns:so="http://semanticscience.org/sadi/ontology/lipinskiserviceontology.owl#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:sio="http://semanticscience.org/resource/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"> <so:smilesmolecule rdf:about="http://semanticscience.org/sadi/ontology/caffeine.rdf#m"> <sio:SIO_000008 rdf:resource = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"/> </so:smilesmolecule> <sio:CHEMINF_000018 rdf:about = "http://semanticscience.org/sadi/ontology/caffeine.rdf#msmiles"> <sio:SIO_000300 rdf:datatype="xsd:string">Cn1cnc2n(C)c(=O)n(C)c(=O)c12</sio:SIO_000300> </sio:CHEMINF_000018></rdf:RDF>

The response is in RDF:

The query is in RDF:

Page 21: 2011 ebi industry workshop

21 2011-EBI-Industry-SW::Dumontier

61 Chemical Semantic Web Services

• these and an increasing number of semantic web services are registered at http://sadiframework.org/registry/services/

Page 22: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Now what?

22

Page 23: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier23

Semantic Health and Research Environment

SHARE is an application that execute (SPARQL) queries as workflows over SADI Services

Page 24: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

“Reckoning”

dynamic discovery of instances of OWL classes through synthesis and invocation of a Web Service workflow capable of generating data described by the OWL class restrictions, followed by reasoning to classify the data

into that ontology

24

Page 25: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

ChEBI publishes (non-SW) data!

25

Page 26: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Bio2RDF provides ChEBI in RDF

26

Page 27: 2011 ebi industry workshop

27 2011-EBI-Industry-SW::Dumontier

Bio2RDF covers the major biological databases

Page 28: 2011 ebi industry workshop

28

Bio2RDF’s RDFized data fits together

Page 29: 2011 ebi industry workshop

29

Resource Description Framework (RDF)

Uniform Resource Identifier (URI) can be used as entity names

Bio2RDF specifies the naming convention

http://bio2rdf.org/uniprot:P05067

is a name for Amyloid precursor protein

http://bio2rdf.org/omim:104300

is a name for Alzheimer disease

uniprot:P05067

omim:104300

Allows one to talk about anything

Page 30: 2011 ebi industry workshop

30

Life Science Dataset Registry Coordinates Naming

• Provides stable URI patterns for records and the entities they describe.

Directory Service• ~1500 datasets & dozens of resolvers.

Discovery Service• Registry links entities to records and their representations (RDF/XML,

HTML, etc) and provider (Bio2RDF, Uniprot)

Redirection Service• Automatic redirection to data provider document

Stanford : 22-04-2010

Page 31: 2011 ebi industry workshop

31 2011-EBI-Industry-SW::Dumontier

Bio2RDF is now serving over 40 billion triples of linked biological data

Page 32: 2011 ebi industry workshop

32

Bio2RDF is a framework to create and provision linked data networks

Francois Belleau, Laval UniversityMarc-Alexandre Nolin, Laval University

Peter Ansell, Queensland University of TechnologyMichel Dumontier, Carleton University

Page 33: 2011 ebi industry workshop

33 2011-EBI-Industry-SW::Dumontier

Bio2RDF is part of a growing web of linked data

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 34: 2011 ebi industry workshop

34 2011-EBI-Industry-SW::Dumontier

something you can lookup or search for with rich descriptions

Page 35: 2011 ebi industry workshop

35 2011-EBI-Industry-SW::Dumontier

SPARQL is the new cool kid on the query block

SQL SPARQL

Page 36: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Query for log p

36

Page 37: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier37

Page 38: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Query: Is caffeine a drug-like molecule?

38

Page 39: 2011 ebi industry workshop

39 2011-EBI-Industry-SW::Dumontier

Page 40: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier

Benefits

• Data remains distributed – as the internet was meant to be!

• Data is not “exposed” as a SPARQL endpoint– greater provider-control over computational resources

• Service invocation is straightforward and matchmaking by reasoning about ontology-based input/output descriptions

40

Page 41: 2011 ebi industry workshop

41 2011-EBI-Industry-SW::Dumontier

Is acetaminophen toxic?

• Classical approaches involve decision trees or machine learning over validated data.

• Algorithms are often proprietary, even by the regulatory agencies

• Issues around which data was used, and what the informative parameters are, and how easily can new information affect the outcomes?

Page 42: 2011 ebi industry workshop

42 2011-EBI-Industry-SW::Dumontier

OWLED2011 : Large-Scale Boolean Feature Based Trees as OWL ontologies

Page 43: 2011 ebi industry workshop

43 2011-EBI-Industry-SW::Dumontier

DL Reasoners give Explanations

Page 44: 2011 ebi industry workshop

44 2011-EBI-Industry-SW::Dumontier

Summary

• Semantic Web technologies offer tantalizing ability to create and share data and services for drug discovery– Bio2RDF provides linked life science data– SADI provides a framework to provide semantic web

services– SHARE allows us to simultaneously query and reason

about data and services represented using RDF/OWL– Expressive ontologies can be used to make toxicity

decisions transparent

Page 45: 2011 ebi industry workshop

2011-EBI-Industry-SW::Dumontier45

Acknowledgements

Bio2RDF: Peter Ansell, Francois Belleau, Allison Callahan, Jacques Corbeil, Jose Cruz-Toledo, Alex De Leon, Steve Etlinger, James Hogan, Nichealla Keath, Jean Morissette, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault and, Paul Roe

SADI: Christopher Baker, Melanie Courtot, Jose Cruz-Toledo, Steve Etlinger, Nichealla Keath, Artjom Klein, Luke McCarthy, Silvane Paixao, Ben Vandervalk, Natalia Villanueva-Rosales, Mark Wilkinson

CHEMINF GroupLeo ChepelevJanna HastingsEgon WillighagenNico Adams

Toxicity GroupLeo ChepelevDana Klassen

Page 46: 2011 ebi industry workshop

46 2011-EBI-Industry-SW::Dumontier

[email protected]

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier