44
Scientific Lenses over Linked Data An approach to support multiple integrated views Alasdair J G Gray [email protected] alasdairjggray.co.uk @gray_alasdair

Scientific Lenses over Linked Data An approach to support multiple integrated views

Embed Size (px)

DESCRIPTION

When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied. In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Citation preview

Page 1: Scientific Lenses over Linked Data An approach to support multiple integrated views

Scientific Lenses over Linked DataAn approach to support multiple integrated views

Alasdair J G [email protected]

alasdairjggray.co.uk

@gray_alasdair

Page 2: Scientific Lenses over Linked Data An approach to support multiple integrated views

Open PHACTS Use Case

“Let me compare MW, logP

and PSA for launched

inhibitors of human &

mouse oxidoreductases”

Chemical Properties (Chemspider)

Launched drugs (Drugbank)

Human => Mouse (Homologene)

Protein Families (Enzyme)

Bioactivty Data (ChEMBL)

… other info (Uniprot/Entrez etc.)

“Let me compare MW, logP

and PSA for launched

inhibitors of human &

mouse oxidoreductases”

16 October 2014 Scientific Lenses – A. J. G. Gray 1

Page 3: Scientific Lenses over Linked Data An approach to support multiple integrated views

Discovery Platform

16 October 2014 Scientific Lenses – A. J. G. Gray 2

Drug Discovery Platform

Apps

Domain API

Interactive

responses

Production quality

integration platform

Method

Calls

Page 4: Scientific Lenses over Linked Data An approach to support multiple integrated views

App EcosystemAn “App Store”?

http://www.openphactsfoundation.org/apps.html

Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium

MOE Collector Cytophacts Utopia Garfield SciBite

KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna

16 October 2014

Page 5: Scientific Lenses over Linked Data An approach to support multiple integrated views

API Hits

16 October 2014 Scientific Lenses – A. J. G. Gray 4

April 2013 – March 2014: 15.8m

April 2014 – Sept 2014: 14m

Total: 29.8 million

Page 6: Scientific Lenses over Linked Data An approach to support multiple integrated views

Linked Data API

16 October 2014 Scientific Lenses – A. J. G. Gray 5

Drug

Disease (1.4)

PathwayTarget

https://dev.openphacts.org/

Page 7: Scientific Lenses over Linked Data An approach to support multiple integrated views

Source Initial Records Triples Properties

ChEMBL 1,481,473 304,360,749 77

DrugBank 19,628 517,584 74

UniProt 564,246 405,473,138 82

ENZYME 6,187 73,838 2

ChEBI 40,575 1,673,863 2

GeneOntology 38,137 2,447,682 26

GOA 661,232 1,765,622,393 15

ChemSpider 1,361,568 215,193,441 23

ConceptWiki 2,828,966 4,291,131 1

WikiPathways 946 1,949,074 34

Open PHACTS Data

16 October 2014 Scientific Lenses – A. J. G. Gray 6

Page 8: Scientific Lenses over Linked Data An approach to support multiple integrated views

14 January 2013OPS Dataset Descriptions – A. J.

G. Gray 7

Dataset Descriptions in the Open Pharmacological Space

Being replaced by W3C

HCLS community profile

http://tiny.cc/hcls-datadesc-ed

Page 9: Scientific Lenses over Linked Data An approach to support multiple integrated views

OPS Discovery Platform

Nanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)

Domain

Specific

Services

Identity

Resolution

Service

Chemistry

Registration

Normalisation

& Q/C

Identifier

Management

Service

Indexing

Co

re P

latf

orm

P12374

EC2.43.4

CS4532

“Adenosine

receptor 2a”

VoID

Db

Nanopub

Db

VoID

Db

VoID

Nanopub

VoID

Public Content Commercial

Public Ontologies

User

Annotations

Apps

Page 10: Scientific Lenses over Linked Data An approach to support multiple integrated views

Multiple Identities

P12047X31045

GB:29384

16 October 2014 Scientific Lenses – A. J. G. Gray

Andy Law's Third Law

“The number of unique identifiers assigned to an individual is

never less than the number of Institutions involved in the study”http://bioinformatics.roslin.ac.uk/lawslaws/

9

Are these the

same thing?

Page 11: Scientific Lenses over Linked Data An approach to support multiple integrated views

Gleevec®: Imatinib Mesylate

16 October 2014 Scientific Lenses – A. J. G. Gray 10

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib Mesylate

YLMAHDNUQAMNNX-UHFFFAOYSA-N

Page 12: Scientific Lenses over Linked Data An approach to support multiple integrated views

Gleevec®: Imatinib Mesylate

16 October 2014 Scientific Lenses – A. J. G. Gray 11

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib Mesylate

YLMAHDNUQAMNNX-UHFFFAOYSA-N

Are these records the same?

It depends upon your task!

Page 13: Scientific Lenses over Linked Data An approach to support multiple integrated views

BRCA1: Chromosome 17Breast cancer type 1 susceptibility protein

16 October 2014 Scientific Lenses – A. J. G. Gray 12

http://en.wikipedia.org/wiki/File:Protei

n_BRCA1_PDB_1jm7.pnghttp://en.wikipedia.org/wiki/File:BRCA1

_en.png

Genes == Proteins?

Page 14: Scientific Lenses over Linked Data An approach to support multiple integrated views

BRCA1: Chromosome 17Breast cancer type 1 susceptibility protein

16 October 2014 Scientific Lenses – A. J. G. Gray 13

http://en.wikipedia.org/wiki/File:Protei

n_BRCA1_PDB_1jm7.pnghttp://en.wikipedia.org/wiki/File:BRCA1

_en.png

Genes == Proteins?

Are these records the same?

It depends upon your task!

Page 15: Scientific Lenses over Linked Data An approach to support multiple integrated views

Example Use Cases

16 October 2014 Scientific Lenses – A. J. G. Gray 14

I need to perform an

analysis, give me details

of the active compound

in Gleevec.

Which targets are

known to interact

with Gleevec?

Page 16: Scientific Lenses over Linked Data An approach to support multiple integrated views

Scientific Lenses – A. J. G. Gray 15

skos:exactMatch

(InChI)

Strict Relaxed

Analysing Browsing

Structure Lens

16 October 2014

I need to perform an analysis, give me

details of the active compound in

Gleevec.

Page 17: Scientific Lenses over Linked Data An approach to support multiple integrated views

Scientific Lenses – A. J. G. Gray 16

skos:closeMatch

(Drug Name)

skos:closeMatch

(Drug Name)

skos:exactMatch

(InChI)

Strict Relaxed

Analysing Browsing

Name Lens

16 October 2014

Which targets are known to interact

with Gleevec?

Page 18: Scientific Lenses over Linked Data An approach to support multiple integrated views

What is a Scientific Lens?

A lens defines a conceptual view over the data

Specifies operational equivalence conditions

Consists of:

Identifier (URI)

Title (dct:title)

Description (dct:description)

Documentation link (dcat:landingPage)

Creator (pav:createdBy)

Timestamp (pav:createdOn)

Equivalence rules (bdb:linksetJustification)

16 October 2014 Scientific Lenses – A. J. G. Gray 17

Page 19: Scientific Lenses over Linked Data An approach to support multiple integrated views

CHEMBL427526

CHEMBL521CHEMBL175

Lens Effects: Ibuprofen

16 October 2014 Scientific Lenses – A. J. G. Gray 18

Ibuprofen consists of two equally active stereoisomers.

• Stereoisomers not always represented in data

Users wish to retrieve information for any stereoisomer.

Page 20: Scientific Lenses over Linked Data An approach to support multiple integrated views

Default Lens

16 October 2014 Scientific Lenses – A. J. G. Gray 19

Ibuprofen consists of two equally active stereoisomers.

• Stereoisomers not always represented in data

Users wish to retrieve information for any stereoisomer.

Page 21: Scientific Lenses over Linked Data An approach to support multiple integrated views

Stereoisomer Lens

16 October 2014 Scientific Lenses – A. J. G. Gray 20

Ibuprofen consists of two equally active stereoisomers.

• Stereoisomers not always represented in data

Users wish to retrieve information for any stereoisomer.

Page 22: Scientific Lenses over Linked Data An approach to support multiple integrated views

Mapping Generation

16 October 2014 Scientific Lenses – A. J. G. Gray 21

ops:OPS437281

ops:OPS380297

has_stereoundefined_parent[ci:CHEMINF_000456]

ops:OPS380292

is_stereoisomer_of[ci:CHEMINF_000461]

Other relationships

• has part

• is tautomer of

• uncharged counterpart

• isotope

Page 23: Scientific Lenses over Linked Data An approach to support multiple integrated views

Initial Connectivity

16 October 2014 Scientific Lenses – A. J. G. Gray 22

Datasets 37

Linksets 104

Links 7,096,712

Justifications 7

Page 24: Scientific Lenses over Linked Data An approach to support multiple integrated views

Scientific Lenses – A. J. G. Gray 23

Compound Information

16 October 2014

Page 25: Scientific Lenses over Linked Data An approach to support multiple integrated views

Proceed with Caution!

16 October 2014 Scientific Lenses – A. J. G. Gray 24

Page 26: Scientific Lenses over Linked Data An approach to support multiple integrated views

Co-reference Computation

Rules ensure

Unrestricted

transitivity within

conceptual type

Restrict crossing

conceptual types

Based on justifications

Provenance captured

16 October 2014 Scientific Lenses – A. J. G. Gray 25

0..*

0..*

0..*

0..1

0..1

Page 27: Scientific Lenses over Linked Data An approach to support multiple integrated views

Initial Connectivity

16 October 2014 Scientific Lenses – A. J. G. Gray 26

Datasets 37

Linksets 104

Links 7,096,712

Justification

s

7

Page 28: Scientific Lenses over Linked Data An approach to support multiple integrated views

Inferred Connectivity

16 October 2014 Scientific Lenses – A. J. G. Gray 27

Datasets 37

Linksets 883

Links 17,383,846

Justifications 7

Page 29: Scientific Lenses over Linked Data An approach to support multiple integrated views

BridgeDb

16 October 2014 Scientific Lenses – A. J. G. Gray 28

Page 30: Scientific Lenses over Linked Data An approach to support multiple integrated views

?iri cheminf:logd ?logd .

FILTER (?iri = cw:979b545d-f9a9 ||

?iri = cs:2157 ||

?iri = chembl:1280 ||

?iri = db:db00945 )

GRAPH <http://rdf.chemspider.com> {

}

GRAPH <http://…

cw:979b545d-f9a9 cheminf:logd ?logd .

Identity

Mapping

Service(BridgeDB)

Query

Expander

Service

Profiles

Mappings

Q, L1 Q’

[cw:979b545d-f9a9,

cs:2157,

chembl:1280,

db:db00945]

cw:979b545d-f9a9, L1

cw:979b545d-f9a9 cheminf:logd ?logd .

Lenses: Under the hood

• Can also be achieved through UNION

• IMS call adds overhead

16 October 2014 Scientific Lenses – A. J. G. Gray 29

Page 31: Scientific Lenses over Linked Data An approach to support multiple integrated views

Experiment

Is it feasible to use a stand-off

mapping service? Base lines (no external call):

“Perfect” URIs

Linked data querying

Expansion approaches (external service

call):

FILTER by Graph

UNION by Graph

C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S.

Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013.

http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf

Page 32: Scientific Lenses over Linked Data An approach to support multiple integrated views

“Perfect” URI Baseline

WHERE {

GRAPH <chemspider> {

cs:2157 cheminf:logp ?logp .

}

GRAPH <chembl> {

chembl_mol:m1280 cheminf:mw ?mw .

}

}

16 October 2014 Scientific Lenses – A. J. G. Gray 31

Page 33: Scientific Lenses over Linked Data An approach to support multiple integrated views

Linked Data Baseline

WHERE {

GRAPH <chemspider> {

cs:2157 cheminf:logp ?logp .

}

GRAPH <chembl> {

?chemblid cheminf:mw ?mw .

}

cs:2157 skos:exactMatch ?chemblid .

}

16 October 2014 Scientific Lenses – A. J. G. Gray 32

Page 34: Scientific Lenses over Linked Data An approach to support multiple integrated views

Queries

Drawn from Open PHACTS API:

1. Simple compound information (1)

2. Compound information (1)

3. Compound pharmacology (M)

4. Simple target information (1)

5. Target information (1)

6. Target pharmacology (M)

16 October 2014 Scientific Lenses – A. J. G. Gray 33

Page 35: Scientific Lenses over Linked Data An approach to support multiple integrated views

Queries

Drawn from Open PHACTS API:

1. Simple compound information (1)

2. Compound information (1)

3. Compound pharmacology (M)

4. Simple target information (1)

5. Target information (1)

6. Target pharmacology (M)

16 October 2014 Scientific Lenses – A. J. G. Gray 34

Page 36: Scientific Lenses over Linked Data An approach to support multiple integrated views

Data:

167,783,592 triples

Mappings:

2,114,584 triples

Lenses:

1

Experiment Data

16 October 2014 Scientific Lenses – A. J. G. Gray 35

Page 37: Scientific Lenses over Linked Data An approach to support multiple integrated views

Average execution times

Page 38: Scientific Lenses over Linked Data An approach to support multiple integrated views

Average execution times

0.0

18

Page 39: Scientific Lenses over Linked Data An approach to support multiple integrated views

Q6: Target Pharmacology

Page 40: Scientific Lenses over Linked Data An approach to support multiple integrated views

Explorer Screenshot

16 October 2014 Scientific Lenses – A. J. G. Gray 45

Page 41: Scientific Lenses over Linked Data An approach to support multiple integrated views

Explorer Screenshot

16 October 2014 Scientific Lenses – A. J. G. Gray 46

Page 42: Scientific Lenses over Linked Data An approach to support multiple integrated views

Conclusions

Scientific data is complex and messy

Requires flexibility in linking

Equivalence depends upon context

Lenses provide support for operation

equivalence

Chemical structures support automatic

computing of links with justification

16 October 2014 Scientific Lenses – A. J. G. Gray 47

Page 43: Scientific Lenses over Linked Data An approach to support multiple integrated views

Acknowledgements

Royal Society of Chemistry

Colin Batchelor

Karen Karapetyan

Jon Steele

Valery Tkachenko

Antony Williams

University of Manchester

Christian Brenninkmeijer

Ian Dunlop

Carole Goble

Steve Pettifer

Robert Stevens

Swiss Institute for Bioinformatics

Christine Chichester

European Bioinformatics Institute

Mark Davies

Anna Gaulton

John Overington

University of Vienna

Daniela Digles

Maastricht University

Chris Evelo

Andra Waagmeester

Egon Willighagen

VU University of Amsterdam

Paul Groth

Antonis Loizou

Connected Discovery

Lee Harland

16 October 2014 Scientific Lenses – A. J. G. Gray 48

Page 44: Scientific Lenses over Linked Data An approach to support multiple integrated views

Questions

Alasdair J G [email protected]

alasdairjggray.co.uk

@gray_alasdair

Open [email protected]

openphacts.org

@open_phacts