Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

Preview:

DESCRIPTION

 

Citation preview

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

Alasdair J G GrayA.J.G.Gray@hw.ac.uk

www.alasdairjggray.co.uk@gray_alasdair

http://c745.r45.cf2.rackcdn.com/img/2009/lens_filter_coasters.jpg

Open PHACTS Use Case

“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”

Chemical Properties (Chemspider) Launched drugs (Drugbank) Human => Mouse (Homologene) Protein Families (Enzyme) Bioactivty Data (ChEMBL) … other info (Uniprot/Entrez etc.)

“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”

21/05/2014 Brighton Seminar 2

LiteraturePubChem

GenbankPatents Databases

Downloads

Data Integration Data Analysis Firewalled Databases

Repeat @ each companyx

Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944

A single, shared solution.

Funded under• IMI: 2011-14• ENSO: 2014-16

Pre-competitive Informatics

Open PHACTS Discovery Platform

21/05/2014 Brighton Seminar 4

Drug Discovery Platform

Apps

Domain API

Interactive responses

Production qualityintegration platform

MethodCalls

(April 2013 – March 2014)

15.8 million total hits

API Hits

An “App Store”?

http://www.openphactsfoundation.org/apps.html

Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium

MOE Collector Cytophacts Utopia Garfield SciBite

KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna

Drug

Disease

PathwayTarget

https://dev.openphacts.org/

Linked Data API

21/05/2014 Brighton Seminar 7

OPS Discovery Platform

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

Platform Interaction

Provenance

Multiple Identities

Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”

http://bioinformatics.roslin.ac.uk/lawslaws/

21/05/2014 Brighton Seminar 11

P12047X31045P120

47

GB:29384RS_

2353

Are these the same thing?

Gleevec® = Imatinib Mesylate

21/05/2014 Brighton Seminar 12

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

21/05/2014 Brighton Seminar 13

21/05/2014 Brighton Seminar 14

Multiple Links: Different Reasons

21/05/2014 Brighton Seminar 16

Link: skos:closeMatchReason: non-salt form

Link: skos:exactMatchReason: drug name

Strict Relaxed

Analysing Browsing

Dynamic Equality

21/05/2014 Brighton Seminar 17

skos:exactMatch(InChI)

Strict Relaxed

Analysing Browsing

Dynamic Equality

21/05/2014 Brighton Seminar 18

skos:closeMatch(Drug Name)

skos:closeMatch(Drug Name)

skos:exactMatch(InChI)

Initial Connectivity

21/05/2014 Brighton Seminar 19

Datasets 37

Linksets 104

Links 7,096,712

Justifications 7

Compound Information

Genes == Proteins?

BRCA1Breast cancer type 1 susceptibility protein

21/05/2014 Brighton Seminar 21

http://en.wikipedia.org/wiki/File:Protein_BRCA1_PDB_1jm7.png

http://en.wikipedia.org/wiki/File:BRCA1_en.png

Proceed with Caution!

21/05/2014 Brighton Seminar 22

Co-reference Computation

Rules ensure• Unrestricted transitivity

within conceptual type• Restrict crossing

conceptual types

Based on justifications

Provenance captured

Target

Protein

Gene

21/05/2014 Brighton Seminar 23

0..*

0..*

0..*

0..1

0..1

Initial Connectivity

21/05/2014 Brighton Seminar 24

Datasets 37

Linksets 104

Links 7,096,712

Justifications 7

Inferred Connectivity

21/05/2014 Brighton Seminar 25

Datasets 37

Linksets 883

Links 17,383,846

Justifications 7

BridgeDb

21/05/2014 Brighton Seminar 26

http://ops.rsc.org/OPS45975 http://ops.rsc.org/OPS45978

has_isotopically_unspecified_parent [CHEMINF:000459]

has OPS normalized counterpart [CHEMINF:000458]

http://ops.rsc.org/OPS45991

is_tautomer_of[chebi:is_tautomer_of]

http://ops.rsc.org/OPS45987

has_stereoundefined_parent [CHEMINF:000456]

http://ops.rsc.org/OPS45981

Lenses

OPS Discovery Platform

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

?iri cheminf:logd ?logd .FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 )

cw:979b545d-f9a9 cheminf:logd ?logd .GRAPH <http://rdf.chemspider.com> {

}

cw:979b545d-f9a9 cheminf:logd ?logd .

Query Expansion

Identity Mapping Service

(BridgeDB)

Query Expander Service

Profiles

Mappings

Q, L1 Q’

[cw:979b545d-f9a9,cs:2157, chembl:1280,db:db00945]

cw:979b545d-f9a9, L1

Can also be achieved through UNION

21/05/2014 Brighton Seminar 29

Experiment

Is it feasible to use a stand-off mapping service?• Base lines (no external call):

– “Perfect” URIs– Linked data querying

• Expansion approaches (external service call):– FILTER by Graph– UNION by Graph

C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf21/05/2014 Brighton Seminar 30

“Perfect” URI BaselineWHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . }}

21/05/2014 Brighton Seminar 31

Linked Data BaselineWHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid .}

21/05/2014 Brighton Seminar 32

Queries

Drawn from Open PHACTS API:1. Simple compound information (1)2. Compound information (1)3. Compound pharmacology (M)4. Simple target information (1)5. Target information (1)6. Target pharmacology (M)

21/05/2014 Brighton Seminar 33

Queries

Drawn from Open PHACTS API:1. Simple compound information (1)2. Compound information (1)3. Compound pharmacology (M)4. Simple target information (1)5. Target information (1)6. Target pharmacology (M)

21/05/2014 Brighton Seminar 34

Data:167,783,592 triples

Mappings:2,114,584 triples

Lenses:1

Experiment Data

21/05/2014 Brighton Seminar 35

Average execution times

36

Average execution times

0.01

8

37

Q6: Target Pharmacology

44

Conclusions

• Computing co-reference advantageous– Requires less raw linksets– Larger coverage across datasets

• Rules ensure control– Genes can equal proteins– Compounds never equal proteins

• Provenance captured throughout

21/05/2014 Brighton Seminar 45

Conclusions

• Query expansion slower in general– Due to separate service call– Difference below human perception– UNION faster than FILTER on Virtuoso

• Stand-off mappings feasible• Infrastructure can support lenses

21/05/2014 Brighton Seminar 46

Strict Relaxed

Analysing Browsing

Questions

A.J.G.Gray@hw.ac.ukwww.alasdairjggray.co.uk@gray_alasdair

pmu@openphacts.orgwww.openphacts.org@open_phacts

Recommended